{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Retrieval from OpenAI: file_search"
      ],
      "metadata": {
        "id": "zsi1IbtbKBjK"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Retrieval** in OpenAI is performed using **\"file search\" tool** when creating your assistant or a theard.\n",
        "\n",
        "(It's declared in the same place than code_interpreter tool if you are familiar with it)\n",
        "\n",
        "What does file_search tool do?\n",
        "* File search tool is the **RAG** (**Retrieval Augmented Generation**) pipeline developed by OpenAI in their **Assistant AI** API.\n",
        "\n",
        "* It enhances the Assistant's capabilities by incorporating external knowledge.\n",
        "\n",
        "* When using this tool, OpenAI will process your documents: parse it, chunk it, create and store embedding in vector store, and then retrieve relevant chunks using both keyword and smeantic search.\n",
        "\n"
      ],
      "metadata": {
        "id": "mbgkMY9VGqFQ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [],
      "metadata": {
        "id": "9UCO4hnGwb2d"
      }
    },
    {
      "cell_type": "markdown",
      "source": [],
      "metadata": {
        "id": "wmRrAUwBwbz2"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "In this tools, OpenAI applies several built-in retrieval best practices to extarct efficently data from a document:\n",
        "\n",
        "*   Rewrites user queries to optimize them for search.\n",
        "*   Breaks down complex user queries into multiple searches it can run in parallel.\n",
        "*   Runs both keyword and semantic searches across both assistant and thread vector stores.\n",
        "*   Reranks search results to pick the most relevant ones before generating the final response.\n",
        "\n",
        "\n",
        "**Vector Store**\n",
        "\n",
        "OpenAI provides also a vector store creation when uploading files with a default chunking Stragey:\n",
        "\n",
        "*   max_chunk_size_tokens: 800 (can be customized)\n",
        "*   chunk_overlap_tokens: 400 (can be customized)\n",
        "*  It's using *text-embedding-3-large* model\n",
        "*  It returns 20 chunks (can be customized) with their relevancy score to the user query"
      ],
      "metadata": {
        "id": "vvyHpRFUKFi5"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "mXHZW1dx4PMA"
      },
      "outputs": [],
      "source": [
        "!pip install openai -q"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "from google.colab import userdata\n",
        "openai_api_key = userdata.get('OPENAI_API_KEY')\n",
        "\n",
        "from openai import OpenAI\n",
        "client = OpenAI(api_key=openai_api_key)"
      ],
      "metadata": {
        "id": "TLR_W-zp4W1-"
      },
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Upload file to OpenAI"
      ],
      "metadata": {
        "id": "534aHg6c4fD6"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "* Load the file from internet to local\n",
        "* Create a vector store first\n"
      ],
      "metadata": {
        "id": "G_40qkU-6u1B"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!wget \"https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/c7c14359-36fa-40c3-b3ca-5bf7f3fa0b96.pdf\" -O amzn_2023_10k.pdf"
      ],
      "metadata": {
        "id": "N4Ikgz-96r4p"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Create a Vector Store"
      ],
      "metadata": {
        "id": "h91mhVlM8aWW"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "vector_store = client.beta.vector_stores.create(name=\"Financial Analyst\")"
      ],
      "metadata": {
        "id": "J3NPhaZa6tbR"
      },
      "execution_count": 12,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "vector_store"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "CWXtIT7RCnNX",
        "outputId": "8b7f59bf-690e-4a10-9a24-66eebb5d2b6f"
      },
      "execution_count": 13,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "VectorStore(id='vs_lIw5lPrbgo8oX8RqwzgFJr0j', created_at=1725822560, file_counts=FileCounts(cancelled=0, completed=0, failed=0, in_progress=0, total=0), last_active_at=1725822560, metadata={}, name='Financial Analyst', object='vector_store', status='completed', usage_bytes=0, expires_after=None, expires_at=None)"
            ]
          },
          "metadata": {},
          "execution_count": 13
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# This is how you can find the chunking values parametered per default in OpenAI:\n",
        "vector_store_files.data[0].chunking_strategy"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "6N7t6YqMYLmb",
        "outputId": "90f5491b-836f-403d-89fc-020aa2a84f10"
      },
      "execution_count": 6,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "StaticFileChunkingStrategyObject(static=StaticFileChunkingStrategy(chunk_overlap_tokens=400, max_chunk_size_tokens=800), type='static')"
            ]
          },
          "metadata": {},
          "execution_count": 6
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# #List vector store with their files\n",
        "# vector_store_files = client.beta.vector_stores.files.list(\n",
        "#   vector_store_id='vs_lIw5lPrbgo8oX8RqwzgFJr0j'\n",
        "# )\n",
        "# print(vector_store_files)"
      ],
      "metadata": {
        "id": "tIHScbVDXQZF"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# # Delete a vector stire with a file\n",
        "# deleted_vector_store_file = client.beta.vector_stores.files.delete(\n",
        "#     vector_store_id='vs_tX1Y8hrMjZZArJOIVrWjkcKk',\n",
        "#     file_id='file-h3Kn24nP0VMt5YRKKPTac3mF'\n",
        "# )\n",
        "# print(deleted_vector_store_file)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "GlHYF0pUREwO",
        "outputId": "65e4c469-436b-4ef1-94b3-1c61a1e56832"
      },
      "execution_count": 67,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "VectorStoreFileDeleted(id='file-h3Kn24nP0VMt5YRKKPTac3mF', deleted=True, object='vector_store.file.deleted')\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Upload the file to OpenAI using the Vector Store"
      ],
      "metadata": {
        "id": "SpgBPeVH8qfV"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Upload files to OpenAI\n",
        "file_paths = [\"./amzn_2023_10k.pdf\"]\n",
        "file_streams = [open(path, \"rb\") for path in file_paths]\n",
        "\n",
        "# Use the upload_and_poll endpoint to upload the files, add them to the vector store, and poll the status of the file batch for completion.\n",
        "# To ensure all contents have finihsed uploading, check the status. Status must now be completed (instead of in_progress)\n",
        "file_batch = client.beta.vector_stores.file_batches.upload_and_poll(\n",
        "  vector_store_id=vector_store.id, files=file_streams\n",
        ")\n",
        "\n",
        "print(file_batch.status)\n",
        "print(file_batch.file_counts)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "qVpmM1-08Jov",
        "outputId": "59e8d720-4a70-4401-b182-a214ff13c8f3"
      },
      "execution_count": 14,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "completed\n",
            "FileCounts(cancelled=0, completed=1, failed=0, in_progress=0, total=1)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Create an assistant with the vector store information"
      ],
      "metadata": {
        "id": "7V4bZmzl8-Jt"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from openai import OpenAI\n",
        "\n",
        "\n",
        "assistant = client.beta.assistants.create(\n",
        "  name=\"Financial Analyst\",\n",
        "  instructions=\"\"\"\n",
        "    You are a expert skilled financial analyst.\n",
        "    Leverage your extensive knowledge of accounting principles and financial reporting standards to provide accurate and insightful answers to questions regarding financial statements.\n",
        "    Analyze details such as income statements, balance sheets, cash flow statements, and notes to the accounts to support your explanations.\n",
        "  \"\"\",\n",
        "  model=\"gpt-4o-mini\",\n",
        "  tools=[{\"type\": \"file_search\"}], #before April, it was called retrieval\n",
        "  tool_resources={\"file_search\": {\"vector_store_ids\": [vector_store.id]}},\n",
        ")\n",
        "\n",
        "## if you have already created your assistant without vector store and you want to add it to the assistant, you need to update it like this:\n",
        "# assistant = client.beta.assistants.update(\n",
        "#   assistant_id=assistant.id,\n",
        "#   tool_resources={\"file_search\": {\"vector_store_ids\": [vector_store.id]}},\n",
        "# )"
      ],
      "metadata": {
        "id": "I2k5ZSNe9CGV"
      },
      "execution_count": 15,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Create a thread"
      ],
      "metadata": {
        "id": "V-9zYU7f_3hA"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "thread = client.beta.threads.create()"
      ],
      "metadata": {
        "id": "d_YwHbOe_5lg"
      },
      "execution_count": 16,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "You can also create a thread with a vector store, to do that, you need:\n",
        "* To upload the file (raw) to the OpenAI\n",
        "* Than create a thread and attach to the uploaded file\n",
        "* This will create under the hood (in OpenAI) a vector Store.\n",
        "```\n",
        "# Upload the file to OpenAI\n",
        "uploaded_file = client.files.create(\n",
        "  file=open(\"./amzn_2023_10k.pdf\", \"rb\"), purpose=\"assistants\"\n",
        ")\n",
        " ```\n",
        " ```\n",
        "# Create a thread and attach the file to the message\n",
        "thread = client.beta.threads.create(\n",
        "  messages=[\n",
        "    {\n",
        "      \"role\": \"user\",\n",
        "      \"content\": \"What was the net income of 2023?\",\n",
        "      \"attachments\": [\n",
        "        { \"file_id\": uploaded_file.id, \"tools\": [{\"type\": \"file_search\"}] }\n",
        "      ],\n",
        "    }\n",
        "  ]\n",
        ")\n",
        "```\n",
        "```\n",
        "# The thread now has a vector store with that file in its tool resources.\n",
        "print(thread.tool_resources.file_search)\n",
        "```\n",
        "\n",
        "\n",
        "* When creating a run on this theard, the file search tool will query both the vector Store from the assistant and the one from the thread.\n",
        "\n"
      ],
      "metadata": {
        "id": "-YR2QCDuCAv2"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Create a Run"
      ],
      "metadata": {
        "id": "LFNim97TCTtd"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "run = client.beta.threads.runs.create(\n",
        "    thread_id=thread.id,\n",
        "    assistant_id=assistant.id,\n",
        ")\n",
        "print(run.status)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "LDj_0c7D_-Mw",
        "outputId": "1a584c15-bff8-4f7a-ec49-2bbd48865ee8"
      },
      "execution_count": 29,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "queued\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Start chatting"
      ],
      "metadata": {
        "id": "NnUq7Tp2Erhp"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "For the moment I didn't ask any question:"
      ],
      "metadata": {
        "id": "v3liODdVEunS"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "messages = client.beta.threads.messages.list(thread_id=thread.id)"
      ],
      "metadata": {
        "id": "z6pj67cEDCKc"
      },
      "execution_count": 20,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "nbr_messages = len(messages.data)\n",
        "interaction = nbr_messages\n",
        "for message in messages.data:\n",
        "  content = message.content[0]\n",
        "  if content.type == 'text':\n",
        "    print(f\"Interaction #{interaction}\")\n",
        "    print(f\"ROLE={message.role}\")\n",
        "    print(content.text.value)\n",
        "    print(\"\\n\")\n",
        "    print(\"--\"*50)\n",
        "  interaction-=1"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "KscZgd8EDJ3D",
        "outputId": "88cfa1a8-86a8-4500-c979-6a18af6303bc"
      },
      "execution_count": 21,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Interaction #1\n",
            "ROLE=assistant\n",
            "I see that you've uploaded files, but I don't have specific information about their content yet. What specific information or analysis are you looking for regarding the financial statements in these files?\n",
            "\n",
            "\n",
            "----------------------------------------------------------------------------------------------------\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Methods"
      ],
      "metadata": {
        "id": "vh_LPtWcE4mJ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Here are several methods that will effectively assist in asking questions to assistants (such as creating a message within a thread or initiating a run) and in parsing and understanding the responses provided by the assistant."
      ],
      "metadata": {
        "id": "udki3yhME5_R"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def send_message(message_user, my_thread_id, my_assistant_id):\n",
        "  \"\"\"\n",
        "  Create a message with the user request, and then create a run that will ask the Assistant to run on the requested question:\n",
        "  \"\"\"\n",
        "  params = {\n",
        "      \"thread_id\": my_thread_id,\n",
        "      \"role\": \"user\",\n",
        "      \"content\": message_user,\n",
        "  }\n",
        "  thread_message = client.beta.threads.messages.create(**params)\n",
        "\n",
        "  run = client.beta.threads.runs.create(\n",
        "      thread_id=my_thread_id,\n",
        "      assistant_id=my_assistant_id,\n",
        "  )\n",
        "  return run\n",
        "\n",
        "def get_response(my_thread_id):\n",
        "  return client.beta.threads.messages.list(thread_id=my_thread_id)\n",
        "\n",
        "def print_messages(messages):\n",
        "  nbr_messages = len(messages.data)\n",
        "  interaction = nbr_messages\n",
        "  for message in messages.data:\n",
        "    content = message.content[0]\n",
        "    print(f\"Interaction #{interaction} and {content.type}\")\n",
        "    if content.type == 'text':\n",
        "      print(f\"ROLE={message.role}\")\n",
        "      print(content.text.value)\n",
        "      print(\"\\n\")\n",
        "      print(\"--\"*50)\n",
        "    else:\n",
        "      print(content.type)\n",
        "    interaction-=1"
      ],
      "metadata": {
        "id": "NSuoVmudDWMi"
      },
      "execution_count": 24,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Question: \"What was the net income of 2023?\""
      ],
      "metadata": {
        "id": "_gE0LpKJFweQ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "message_user = \"What was the net income of 2023?\"\n",
        "run = send_message(message_user, thread.id, assistant.id)\n",
        "print(run.id, run.status)\n",
        "run.tools"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "w8IlSpePEQWS",
        "outputId": "114cad7b-6ea6-4448-f37b-c709c75daad4"
      },
      "execution_count": 46,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "run_Lj3Wky79Kbv1kbTxb5BGveSE queued\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[FileSearchTool(type='file_search', file_search=FileSearch(max_num_results=None, ranking_options=FileSearchRankingOptions(ranker='default_2024_08_21', score_threshold=0.0)))]"
            ]
          },
          "metadata": {},
          "execution_count": 46
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Give the assistant some time to think (in seconds):"
      ],
      "metadata": {
        "id": "tfp5RPKaEQIy"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "messages = get_response(thread.id)\n",
        "print_messages(messages)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "2OtBpfS-Dqpb",
        "outputId": "ecfa0baa-e2f2-462f-cef9-3d1ff61709c0"
      },
      "execution_count": 25,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Interaction #3 and text\n",
            "ROLE=assistant\n",
            "The net income for the year 2023 was $30,425 million (or $30.4 billion)【5:1†source】.\n",
            "\n",
            "\n",
            "----------------------------------------------------------------------------------------------------\n",
            "Interaction #2 and text\n",
            "ROLE=user\n",
            "What was the net income of 2023?\n",
            "\n",
            "\n",
            "----------------------------------------------------------------------------------------------------\n",
            "Interaction #1 and text\n",
            "ROLE=assistant\n",
            "I see that you've uploaded files, but I don't have specific information about their content yet. What specific information or analysis are you looking for regarding the financial statements in these files?\n",
            "\n",
            "\n",
            "----------------------------------------------------------------------------------------------------\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Get the specific information about the retrieved chunks"
      ],
      "metadata": {
        "id": "8GMloEMQOHsy"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "run_steps = client.beta.threads.runs.steps.list(\n",
        "    thread_id=thread.id,\n",
        "    run_id='run_mPnuN4vyVaoYkU9IcpkQaSi1'\n",
        "    # run_id = 'run_Lj3Wky79Kbv1kbTxb5BGveSE'\n",
        ")\n",
        "\n",
        "print(run_steps)"
      ],
      "metadata": {
        "id": "6MaFX-ecMImF"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "for step in run_steps.data:\n",
        "  print(step.id, step.step_details)\n",
        "        # .tool_calls[0].file_search.results[0].content)\n",
        "\n",
        "# Results\n",
        "# step_GvSdx4K5BJUlFZruVTpwo3OP MessageCreationStepDetails(message_creation=MessageCreati...\n",
        "# step_oPkFeY8F1ny1i0zv6rWaMLyp ToolCallsStepDetails(tool_calls=[FileSearchToolCall(id..."
      ],
      "metadata": {
        "id": "9l6JToQgMWc9"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Advanced Specifities"
      ],
      "metadata": {
        "id": "khsQfr7hWMbd"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Get the relevance score for each chunk"
      ],
      "metadata": {
        "id": "FTv81e-RPZpi"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "By taking as example the previous question and answer:"
      ],
      "metadata": {
        "id": "Rr7f-xbxWP3M"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "In the last step \"ToolCallsStepDetails\" you can get the relevance score to the user's query for each of the 20 retrieved chunks. This value can be modified file_search.max_num_results to 5 if needed when creating the asstsistant or the run."
      ],
      "metadata": {
        "id": "Ob1Xbs4LONdi"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# for step = 'step_oPkFeY8F1ny1i0zv6rWaMLyp'\n",
        "step.step_details.tool_calls[0].file_search.results\n",
        "# Even if the content is None, here, see script below to know how to get the adequate chunk raw data.\n",
        "\n",
        "#Results\n",
        "# [FileSearchResult(file_id='file-JmeROw0FJQT7vjOcvHHUHzT8', file_name='amzn_2023_10k.pdf', score=0.8368480370110275, content=None),\n",
        "#  FileSearchResult(file_id='file-JmeROw0FJQT7vjOcvHHUHzT8', file_name='amzn_2023_10k.pdf', score=0.8209594035063414, content=None),\n",
        "# ....\n",
        "#  FileSearchResult(file_id='file-JmeROw0FJQT7vjOcvHHUHzT8', file_name='amzn_2023_10k.pdf', score=0.3300425731103375, content=None),\n",
        "#  FileSearchResult(file_id='file-JmeROw0FJQT7vjOcvHHUHzT8', file_name='amzn_2023_10k.pdf', score=0.3284635772545966, content=None)]"
      ],
      "metadata": {
        "id": "p2eNfd0qM8Vi"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Get the details chunks retrieved from OpenAI to answer to your question:"
      ],
      "metadata": {
        "id": "s6D0Qq4CQIA3"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Get the run and the step ids. For step id, it's the one that retrieved the chunks, with object ToolCallsStepDetails."
      ],
      "metadata": {
        "id": "ZP7zzE9MQjkm"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "run_step = client.beta.threads.runs.steps.retrieve(\n",
        "    thread_id=thread.id,\n",
        "    run_id='run_mPnuN4vyVaoYkU9IcpkQaSi1',\n",
        "    step_id=\"step_oPkFeY8F1ny1i0zv6rWaMLyp\",\n",
        "    include=[\"step_details.tool_calls[*].file_search.results[*].content\"]\n",
        ")\n",
        "\n",
        "print(run_step)"
      ],
      "metadata": {
        "id": "rB96N9U7LjL1"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "run_step.step_details.tool_calls[0].file_search.results[0].content\n",
        "\n",
        "\n",
        "# [FileSearchResultContent(text='Consolidated\\nNet sales $ 469,822 $ 513,983 $ 574,785 \\nOperating expenses 444,943 501,735 537,933 \\nOperating income 24,879 12,248 36,852 \\n...."
      ],
      "metadata": {
        "id": "8FqDIULYQCIO"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Chunk Strategy"
      ],
      "metadata": {
        "id": "Y6asAC7LWnkk"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "2 things that you can modify:\n",
        "* max_chunk_size_tokens\n",
        "* chunk_overlap_tokens\n",
        "\n",
        "When adding files to the vector store, you can modify these 2 values:"
      ],
      "metadata": {
        "id": "oszIH9UsWi5D"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# vector_store_file = client.beta.vector_stores.files.create(\n",
        "#   vector_store_id='vs_lIw5lPrbgo8oX8RqwzgFJr0j',\n",
        "#   file_id='file-JmeROw0FJQT7vjOcvHHUHzT8',\n",
        "#   chunking_strategy={'type':'static', 'static':{'max_chunk_size_tokens':1240, 'chunk_overlap_tokens':500}}\n",
        "# )\n",
        "# print(vector_store_file)"
      ],
      "metadata": {
        "id": "dKvGIlywXAAL"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Summarize"
      ],
      "metadata": {
        "id": "Blkj36kYZXK4"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Question: \"Can you summarize the document?\""
      ],
      "metadata": {
        "id": "BQJ1PuCbF6in"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "message_user = \"Can you summarize the document?\"\n",
        "send_message(message_user, thread.id, assistant.id)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "o-4y1r6jD6JK",
        "outputId": "6bd2a70f-527b-4120-e3ff-15f3b21bb9d1"
      },
      "execution_count": 26,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "Run(id='run_mTrttRcyqi2CE6P4su8RbIO5', assistant_id='asst_z1ap2fxFg6yxCzKKvhs8DOtc', cancelled_at=None, completed_at=None, created_at=1725822920, expires_at=1725823520, failed_at=None, incomplete_details=None, instructions='\\n    You are a expert skilled financial analyst. \\n    Leverage your extensive knowledge of accounting principles and financial reporting standards to provide accurate and insightful answers to questions regarding financial statements. \\n    Analyze details such as income statements, balance sheets, cash flow statements, and notes to the accounts to support your explanations.\\n  ', last_error=None, max_completion_tokens=None, max_prompt_tokens=None, metadata={}, model='gpt-4o-mini', object='thread.run', parallel_tool_calls=True, required_action=None, response_format='auto', started_at=None, status='queued', thread_id='thread_Sp02A7D6nY5YEgsmxb1NMZDa', tool_choice='auto', tools=[FileSearchTool(type='file_search', file_search=FileSearch(max_num_results=None, ranking_options=FileSearchRankingOptions(ranker='default_2024_08_21', score_threshold=0.0)))], truncation_strategy=TruncationStrategy(type='auto', last_messages=None), usage=None, temperature=1.0, top_p=1.0, tool_resources={})"
            ]
          },
          "metadata": {},
          "execution_count": 26
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "messages = get_response(thread.id)\n",
        "print_messages(messages)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ZGJQ3HpGD-4C",
        "outputId": "653bc31a-13b1-40e3-fa47-30867e489972"
      },
      "execution_count": 28,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Interaction #5 and text\n",
            "ROLE=assistant\n",
            "Here’s a summary of the document, which appears to be a 10-K filing for Amazon.com, Inc. for the year ended December 31, 2023:\n",
            "\n",
            "1. **Financial Performance**:\n",
            "   - Amazon reported total net sales of $574.785 billion for 2023, representing a 12% increase from the prior year. \n",
            "   - The company achieved a net income of $30.425 billion in 2023, a significant recovery from a net loss of $2.722 billion in 2022【5:1†source】【9:15†source】.\n",
            "   - Operating income improved to $36.852 billion from $12.248 billion in the previous year【9:9†source】.\n",
            "   - Earnings per share for 2023 were $2.95 (basic) and $2.90 (diluted)【9:15†source】.\n",
            "\n",
            "2. **Sales and Segments**:\n",
            "   - Sales growth was driven mainly by increased unit sales, third-party seller services, and advertising sales【9:8†source】.\n",
            "   - The breakdown of sales showed that North America contributed 61%, international sales accounted for 23%, and AWS (Amazon Web Services) made up 16% of total sales【9:8†source】.\n",
            "\n",
            "3. **Operating Expenses**:\n",
            "   - Operating expenses totaled $537.933 billion, an increase from $501.735 billion in 2022. Notably, technology and infrastructure costs grew significantly【9:8†source】【9:9†source】.\n",
            "\n",
            "4. **Taxation**:\n",
            "   - The provision for income taxes was $7.120 billion, demonstrating the company’s substantial tax obligations arising from its income generation【9:15†source】【9:7†source】.\n",
            "\n",
            "5. **Capital Expenditures and Cash Flow**:\n",
            "   - Amazon reported strong cash flow from operations of $84.946 billion for 2023 and a positive free cash flow of $36.813 billion【9:1†source】.\n",
            "\n",
            "6. **Market Exposure**:\n",
            "   - The report also discusses Amazon's exposure to market risks, including fluctuations in foreign exchange rates and interest rates【9:9†source】【9:6†source】.\n",
            "\n",
            "This overview provides key insights into Amazon’s financial health, operational efficiency, and overall performance for 2023 as reflected in their 10-K filing. If you need more detailed information on specific sections or figures, feel free to ask!\n",
            "\n",
            "\n",
            "----------------------------------------------------------------------------------------------------\n",
            "Interaction #4 and text\n",
            "ROLE=user\n",
            "Can you summarize the document?\n",
            "\n",
            "\n",
            "----------------------------------------------------------------------------------------------------\n",
            "Interaction #3 and text\n",
            "ROLE=assistant\n",
            "The net income for the year 2023 was $30,425 million (or $30.4 billion)【5:1†source】.\n",
            "\n",
            "\n",
            "----------------------------------------------------------------------------------------------------\n",
            "Interaction #2 and text\n",
            "ROLE=user\n",
            "What was the net income of 2023?\n",
            "\n",
            "\n",
            "----------------------------------------------------------------------------------------------------\n",
            "Interaction #1 and text\n",
            "ROLE=assistant\n",
            "I see that you've uploaded files, but I don't have specific information about their content yet. What specific information or analysis are you looking for regarding the financial statements in these files?\n",
            "\n",
            "\n",
            "----------------------------------------------------------------------------------------------------\n"
          ]
        }
      ]
    }
  ]
}